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Abstract 

We examine the recently introduced NSB estimator of entropies of 
severely undersampled discrete variables and devise a procedure for cal- 
culating the involved integrals. We discover that the output of the estima- 
tor has a well defined limit for large cardinalities of the variables being 
studied. Thus one can estimate entropies with no a priori assumptions 
about these cardinalities, and a closed form solution for such estimates is 
given. 



1 Introduction 

Estimation of functions of a discrete random variable with an unknown probability distri- 
bution using independent samples of this variable seems to be an almost trivial problem 
known to many yet from the high school [|l]]. However, this simplicity vanishes if one con- 
siders an extremely undersampled regime, where K, the cardinality or the alphabet size 
of the random variable, is much larger than N, the number of independent samples of the 
variable. In this case the average number of samples per possible outcome (also called bin 
in this paper) is less than one, relative uncertainty in the underlying probability distribution 
is large, and the usual formulas for estimation of various statistics fail miserably. Then one 
has to use the power of Bayesian statistics to a priori constraint the set of allowable distri- 
butions and thus decrease the posterior error Unfortunately, due to the usual bias-variance 
tradeoff, decreasing the variance this way may lead to an increased bias, i.e., the estimator 
becomes a function of the prior, rather than of the experimental data. 

The situation is particularly bad for inferring the Boltzmann-Shannon entropy, S, one of 
the most important characteristics of a discrete variable. Its frequentist as well as common 
Bayesian estimators have low variances, but high biases that are very difficult to calculate 
(see Ref. |Q| for a review). However, recently ideas from Bayesian model selection iflH 
g[ Hi were used by Nemenman, Shafee, and Bialek to suggest a solution to the problem [^. 
Their method, hereafter called NSB, is robust and unbiased even for severely undersampled 
problems. We will review it and point out that it is equivalent to finding the number of yet 
unseen bins with nonzero probability given K, the maximum cardinality of the variable. 
While estimation of K by model selection techniques will not work, we will show that the 
method has a proper limit K ^ co. Thus one should be able to calculate entropies of 
discrete random variables even without knowing their cardinality. 



2 Summary of the NSB method 



In Bayesian statistics, one uses Bayes rule to expresses posterior probability of a probability 
distribution q = {qi}, i = 1 . . . K,of a discrete random variable with a help of its a priori 
probability, 7^(q). Thus if Ui identical and independent samples from q are observed in bin 
i, such that J2i=i — '^hen the posterior, P(q|n), is 

P(q|n) = ^(^^ ^ ,nLrnq) . 

Following Ref. [Q], we will focus on popular Dirichlet family of priors, indexed by a (hy- 
per)parameter (3: 

Here (5-function and Z{(3) enforce normalizations of q and 'Pp{q) respectively, and T 
stands for Euler's F-function. These priors are common in applications ^ since they, 
as well as the data term, P(n|q), are of a multinomial structure, which is analytically 
tractable. For example, in Ref. |M] Wolpert and Wolf calculated posterior averages, here 
denoted as (...)/}, of many interesting quantities, including the distribution itself, 

{1^)P='^^^ K=KP, (3) 
I\ + K 

and the moments of its entropy, which we will not reprint here. 

As suggested by Eq. (|]), Dirichlet priors add extra /3 sample points to each possible bin. 
Thus for f3 ^ N/K the data is unimportant, and P(q|n) is dominated by the distributions 
close to the uniform one, q « 1/K. The posterior mean of the entropy is then strongly 
biased upwards to its maximum possible value of S'max — In Similarly, for (3 ^ N/ K, 
distributions in the vicinity of the frequentist's maximum likelihood estimate, q — n/N, 
are important, and {S)p has a strong downward bias [||. 

In Ref. [^, Nemenman et al. traced this problem to the properties of the Dirichlet family: 
its members encode reasonable a priori assumptions about q, but not about 5(q). Indeed, it 
turns out that a priori assumptions about the entropy are extremely biased, as may be seen 
from its following a priori moments. 

m = {s\N=o)p = M^^ + ^)-M0 + i), (4) 

<j\P) = {{SS)'\n=o)p = ^Vi(/3 + l)-^i('* + l), (5) 

where t/jmix) = {d/dx)"^^^ lnr(a::) are the polygamma functions. ^(/3) varies smoothly 
from for /3 = 0, through 1 for (3 k, 1/K, and to hiK for j3 ^ oo. a{f3) scales as 
1 / ^/K for almost all f3 (see Ref. ^ for details). This is negligibly small for large K. Thus 
q that is typical in 'P/3{q) usually has its entropy extremely close to some predetermined 
/3-dependent value. It is not surprising then that this bias persists even after N < K data 
are collected. 

The NSB method suggests that to estimate entropy with a small bias one should not look 
for priors that seem reasonable on the space of q, but rather the a priori distribution of 
entropy, V{S{q)), should be flattened. This can be done approximately by noting that 
Eqs. ||) ensure that, for large K, V{S) is almost a (5-function. Thus a prior that enforces 

'in this paper the unit of entropy is nat. Thus all logarithms are natural. 



integration over all non-negative values of /?, which correspond to all a priori expected 
entropies between and In K, should do the job of eliminating the bias in the entropy 
estimation even for N K. While there are probably other options, Ref. [pQ centered on 
the following prior, which is a generalization of Dirichlet mixture priors JlOp to an infinite 
mixture: 

V{c;(3) = ^5[l-f: q)j n ^t"^^ ■ (6) 

Here Z is again the normalizing coefficient, and the term d^/dfi ensures uniformity for the 
a priori expected entropy, ^, rather than for (i. A non-constant prior on /3, V{/3), may be 
used if sufficient reasons for this exist, but we will set it to one in all further developments. 

Inference with the prior, Eq. (||), involves additional averaging over f] (or, equivalently, 0, 
but is nevertheless straightforward. The a posteriori moments of the entropy are 



where the posterior density is (7) 



Nemenman et al. explain why this method should work using the theory of Bayesian model 
selection [|3} Q |], g|. All possible probability distributions, even those that fit the data 
extremely badly, should be included in the posterior averaging. For models with a larger 
volume in q space, the number of such bad q's is greater, thus the probability of the model 
decreases. Correspondingly, such contributions from the phase space factors are usually 
termed Occam razor because they automatically discriminate against bigger, more complex 
models. If the maximum likelihood solution of a complex model explains the data better 
than that of a simpler one,| then the total probability, a certain combination of the maximum 
likelihood and the Occam factors, has a maximum for some non-trivial model, and the 
sharpness of the maximum grows with N. In other words, the data selects a model which 
is simple, yet explains it well. 

In the case of Eq. (Q), we can view different values of (3 as different models. The smaller 
(3 is, the closer it brings us to the frequentist's maximum likelihood solution, so the data 
gets explained better However, as there is less smoothing [cf. Eq. (^], smaller /3 results 
in the larger phase space. Thus, according to Ref. ||7j, one may expect that the integrals in 
Eq. (^) will be dominated by some /?*, appropriate smoothing will be sharply selected, and 
— w {■ ■ ■)pt. In the current paper we will investigate whether a maximum of the integrand 
in Eq. (Q), indeed, exists and will study its properties. The results of the analysis will lead 
us to an extension and a simplification of the NSB method. 



3 Calculation of the NSB integrals 

We will calculate integrals in Eq. using the saddle point method. Since the moments 
of S do not have N dependence, when N is large only the F-terms in p are important for 
estimating the position of the saddle and the curvature around it. We write 

p(e|n) = 7'(/3(0)exp[-£(n,/?,if)] , (9) 

C{n,f3,K) = -^lnF(/3 + n,)+i^lnr(/5)-lnr(K)+lnF(K + iV). (10) 



^This is usually achieved by requiring that models are nested, that is, all q's possible in the simpler 
model are possible in the complex one, but not vice versa. 



Then the saddle point (equivalently, the maximum likelihood) value, k* = Kf3*, solves the 
following equation obtained by differentiating Eq. (|lO|). 

Mn^+n-^Mn+M'^*)-M^*+N) = o, (11) 

i 

where we use Km to denote the number of bins that have, at least, m counts. Note that 
N > Ki > K2 > ■ ■ ■ ■ 

We notice that if K N, and if there are at least a few bins that have more that one datum 
in them, i.e., Ki < N, then the distribution the data is taken from is highly non-uniform. 
Thus the entropy should be much smaller than its maximum value of S'max- Since for any 
f3 = 0(1) the entropy is extremely close to S'max (cf. Ref. [^), small entropy may be 
achievable only if /3* ^ as K 00. Thus we will look for 

K* = K0+ -^Kl + + . . . , (12) 

where none of kj depends on K. Plugging Eq. (|l2|) into Eq. (^l]), after a little algebra we 
get the first few terms in the expansion of k*: 

Mn-i) - ■>po{l) 



= y . , 2 ~ — r— -^TTT: (13) 

^0 



K2 



(14) 



Ki/kI - Vi(ko) + i^ii^o + N) 
and the zeroth order term solves the following algebraic equation 

^ ^ ijoino + N) - M^^o) ■ (15) 

If required, more terms in the expansion can be calculated, but for common applications K 
is so big that none are usually needed. 

We now focus on sol ving Eq. (Ob. For ^ and > 0, the r. h. s. of the equation is 
approximately I/kq I HI- On the other hand, for kq — > 00, it is close to N/no. Thus if 
N — Ki, that is, the number of coincidences among different data, A = iV — Ki, is zero, 
then the 1. h. s. always majorates the r. h. s., and the equation has no solution. If there are 
coincidences, a unique solution exists, and the smaller A is, the bigger kq is. Thus we may 
want to search for kq ~ 1/A + 0(A°). 

Now it is useful to introduce the following notation: 

JV-l i 
m=0 

where each of /jy's scales as N°. Using standard results for polygamma functions JTlj], we 
rewrite Eq. ( p3| ) as 

i^=y(_l).J^ (17) 

Here we introduced the relative number of coincidences, 6 = A/N. Combined with the 
previous observation, Eq. ( |l7| ) suggests that we look for kq of the form 

no^N(^ + bo + b,S+... ] , (18) 



where each of bj's is independent of 6 and scales as N^. 



Substituting this expansion for kq into Eq. ( 17), we see that it is self-consistent, and 



, , N -1 

'-1 = M^)^^^^ (19) 
_ /iv(2) _ -2N+l 



^ .am , /Ar(3)_ jv^-jv-2 



Again, more terms can be calculated if needed. 

This expresses the saddle point value /?* (or k*, or ^*) as a power series in l/iiT and 6. 
In order to complete the evaluation of integrals in Eq. (^, we now need to calculate the 
curvature at this saddle point. Simple algebra results in 



«(/3*) 



Notice that the curvature does not scale as a power of N as was suggested in Ref. [0. Our 
uncertainty in the value of ^* is determined to the first order only by coincidences. One 
can understand this by considering a very large K with most of the bins having negligible 
probabilities. Then counts of = 1 are not informative for entropy estimation, as they can 
correspond to massive bins, as well as to some random bins from the sea of negligible ones. 
However, coinciding counts necessarily signify an important bin, which should influence 
the entropy estimator. Note also that to the first order in 1/ K the exact positioning of 
coincidences does not matter: a few coincidences in many bins or many coincidences in a 
single one produce the same saddle point and the same curvature around it, provided that 
A stays the same. While this is an artifact of our choice of the underlying prior Vp (q) and 
may change in a different realization of the NSB method, this behavior parallels famous 
Ma's entropy estimator, which is also coincidence based Jl2|]. 

In conclusion, if the number of coincidences, not N, is large, then a proper value for (3 is 
selected, and the variance of entropy is small. Then the results of this section transform cal- 
culations of complicated integrals in Eq. to pure algebraic operations. This analysis has 
been used to write a general purpose software library for estimating entropies of discrete 
variables. The library is available from the author. 



4 Choosing a value for Kl 

A question is in order now. If N <^ K, the regime we are mostly interested in, then 
the number of extra counts in occupied bins, Ki(3, is negligible compared to the number of 
extra counts in empty bins, {K — Ki)(3 « Kp. Then Eqs. (H, |]) tell us that selecting (3 (that 
is, integrating over it) means balancing N, the number oractual counts versus k — Kp, 
the number of pseudocounts, or, equivalently, the scaled number of unoccupied bins. Why 
do we vary the pseudocounts by varying /3? Can we instead use Bayesian model selection 
methods to set K7 Indeed, not having a good handle on the value of K is usually one of 
the main reasons why entropy estimation is difficult. Can we circumvent this problem? 

To answer this, note that smaller K leads to a higher maximum likelihood value since the 
total number of pseudocounts is less. Unfortunately, smaller K also means smaller volume 
in the distribution space since there are fewer bins, fewer degrees of freedom, available. 
As a result, Bayesian averaging over K will be trivial: the smallest possible number of 
bins, that is no empty bins, will dominate. This is very easy to see from Eq. (jsj): only the 
first ratio of F-functions in the posterior density depends on K, and it is maximized for 



K = Ki. Thus straight-forward selection of the value of K is not an option. However, in 
the next Section we will suggest a way around this hurdle. 



5 Unknown or infinite K 



When one is not sure about the value of K, it is usually because its simple estimate is 
intolerably large. For example, consider measuring entropy of £-gramms in printed English 
[ pj| ] using an alphabet with 29 characters: 26 different letters, one symbol for digits, one 
space, and one punctuation mark. Then even for t as low as 7, a naive value for K is 
29^ ^ lO^*'. Obviously, only a miniscule fraction of all possible 7-gramms may ever 
happen, but one does not know how many exactly. Thus one is forced to work in the space 
of full cardinality, which is ridiculously undersampled. 

A remarkable property of the NSB method, as follows from the saddle point solution in 
Sec. H, is that it works even for finite N and extremely big K (provided, of course, that there 
are coincidences). Moreover, if K — > oo, the method simplifies since then one should only 
keep the first term in the expansion, Eq. (|l2[). Even more interestingly, for every (3^ 1/ K 
the a priori distribution of entropy becomes an exact delta function since the variance of 
entropy drops to zero as 1/ K, see Eq. (^. Thus the NSB technique becomes more precise 
as K increases. So the solution to the problem of unknown cardinality is to use an upper 
bound estimate for K: it is much better to overestimate K than to underestimate it. If 
desired, one may even assume that K ^ coio simplify the calculations. 

It is important to understand which additional assumptions are used to come to this con- 
clusion. How can a few data points specify entropy of a variable with potentially infinite 
cardinality? As explained in Ref. a typical distribution in the Dirichlet family has a 
very particular rank ordered (Zipf) plot: the number of bins with the probability mass less 
than some q is given by an incomplete i?-function, /, 

where B stand for the usual complete i?-function. NSB fits for a proper value of /3 (and 
K = KP) using bins with coincidences, the head of the rank ordered plot. But knowing 
(3 immediately defines the tails, where no data has been observed yet, and the entropy can 
be calculated. Thus if the Zipf plot for the distribution being studied has a substantially 
longer tail than allowed by Eq. (^3]), then one should suspect the results of the method. For 
example, NSB will produce wrong estimates for a distribution with qi = 0.5, q2, ■ ■ - qK = 
0.5/{K -l),andK ^ oo. 

With this caution in mind, we may now try to calculate the estimates of the entropy and its 
variance for extremely large K. We want them to be valid even if the saddle point analysis 
of Sec. H fails because A is not large enough. In this case /3* 0, but k* — Kj3* is 
some ordinary number The range of entropies now is < 5* < InX ^ oo, so the prior 
on S produced by 7'(q; (3) is (almost) uniform over a semi-infinite range and thus is non- 
normalizable. Similarly, there is a problem normalizing 7'^(q), Eq. {%. However, as is 
common in Bayesian statistics, these problems can be easily removed by an appropriate 
limiting procedure, and we will not pay attention to them in the future. 

When doing integrals in Eq. (^, we need to find out how (S'(n)) p depends on ^(/3). In the 
vicinity of the maximum of p, using the formula for (5(n))^ from Ref. we get 



[{S{n)),-m] 



+ ^Uj=^(^'if)- ^24) 



The expression for the second moment is similar, but complicated enough so that we chose 
not to write it here . The main point is that for K oo, S — A/N 0, and k in the vicin- 
ity of K*, the posterior averages of the entropy and its square are almost indistinguishable 
from ^ and the a priori averages. Since now we are interested in small A (Otherwise we 
can use the saddle point analysis), we will use ^™ instead of {S"^)i3 in Eq. (M). The error 
of such approximation is 0((5, = 0(-^,-^). 

Now we need to slightly transform the Lagrangian, Eq. (|lO|). First, we drop terms that do 
not depend on k since they appear in the numerator and denominator of Eq. and thus 
cancel. Second, we expand around 1/K — Q. This gives 

ni>l ^ 

C{n,K,K) = - ^ \iiT{n,) - KiluK -\nT{K) + \nT{K + N) + 0{—) . (25) 

i 

We note that n is large in the vicinity of the saddle if 5 is small and N is large, cf. Eq. (jig). 
Thus, by definition of ^-functions, lnr(K + N) - IiiT{k) w NiPq{k) + N'^iJji{k)J2. 
Further, ^o{k) ~ InK, and V-'i('«) ~ 1/k [pi]]. Finally, since 'i/'o(l) — —C-y, where is 
the Euler's constant, Eq. (^ says that ^ — C-y « In k. Combining all this, we get 

£(n, K,K)^-Y, lnr(n.) + A(^ ^ C^) + — exp(C^ - C) , (26) 

i 

where the w sign means that we are working with precision O (-^ , -^j . 
Now we can write: 

S « C'7 - — In / e-^d^ , (27) 



d 



dA 

2 „ln K 







The integral involved in these expressions can be easily calculated by substituting exp(C-y — 
£,) = T and replacing the limits of integration l/_ft'exp(C-y) < r < exp(C^) by < 
T < 00. Such replacement introduces errors of the order (l/K)^ at the lower limit and 
6^ exp(~l/5^) at the upper limit. Both errors are within our approximation precision if 
there is, at least, one coincidence. Thus 

In A' . j^2\-'^ 

e'^dC « r(A) j . (29) 

Finally, substituting Eq. ( p9[ ) into Eqs. ( p7| , p8[ ) we get for the moments of the entropy 

S w (C^-ln2) + 21n7V-Vo(A), (30) 

(SS)^ « ^i(A). (31) 

These equations are valid to zeroth order in 1/K and They provide a simple, yet 

nontrivial, estimate of the entropy that can be used even if the cardinality of the variable 
is unknown. Note that Eq. ( |3l| ) agrees with Eq. ( |22[ ) since, for large A, jji jA) w 1/A. 
Interestingly, Eqs. (53, BTl) carry a remarkable resemblance to Ma's method 1 12 1. 



6 Conclusion 



We have further developed the NSB method for estimating entropies of discrete random 
variables. The saddle point of the posterior integrals has been found in terms of a power 



series inl/K and S. It is now clear that validity of the saddle point approximation depends 
not on the total number of samples, but only on the coinciding ones. Further, we have 
extended the method to the case of infinitely many or unknown number of bins and very 
few coincidences. We obtained closed form solutions for the estimates of entropy and its 
variance. Moreover, we specified an easily verifiable condition (extremely long tails), under 
which the estimator is not to be trusted. To our knowledge, this is the first estimator that can 
boast all of these features simultaneously. This brings us one more step closer to a reliable, 
model independent estimation of statistics of undersampled probabiUty distributions. 
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